Avoid crash on a second epoch over a skip/take IterableDataset#4081
Open
robbiebusinessacc wants to merge 1 commit into
Open
Conversation
`DataLoaderShard`/`DataLoaderDispatcher` call `set_epoch(self.iteration)` at the start of each pass, forwarding a nonzero epoch to a wrapped HF `datasets.IterableDataset`. For datasets built via `.skip()`/`.take()` this makes iteration raise `DataSourcesShufflingDisallowed` on the second epoch, since those datasets forbid reshuffling their data sources between epochs. Catch the exception while iterating and reset the dataset epoch to 0 so such datasets can still be iterated across multiple epochs; per-epoch source reshuffling stays available for datasets that support it. Adds a regression test covering both the dispatch and non-dispatch paths. Fixes huggingface#4080
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this fixes
Iterating a prepared HF
datasets.IterableDatasetfor more than one epoch crasheswith
DataSourcesShufflingDisallowedwhen the dataset was built via.skip()/.take()(#4080).DataLoaderShard.__iter__andDataLoaderDispatcher.__iter__callself.set_epoch(self.iteration)at the start of every pass, forwarding a nonzeroepoch to the wrapped dataset. On the second epoch the dataset tries to reshuffle its
data sources at iteration time, which
.skip()/.take()iterables forbid, so itraises
DataSourcesShufflingDisallowed.The fix
Catch
DataSourcesShufflingDisallowedwhile iterating and reset the dataset's epochto 0, then re-create the iterator. These datasets can't reshuffle their sources
between epochs anyway, so this lets them iterate across multiple epochs without
changing behaviour for datasets that do support per-epoch reshuffling. The exception
is imported defensively so
datasetsstays optional.Repro (from the issue)
Tests
Adds
test_iterable_dataset_blocked_source_shuffling_multiple_epochs, covering boththe dispatch and non-dispatch paths. It fails on
mainand passes with the fix; therest of
tests/test_data_loader.pyis unaffected.Fixes #4080